Maerl DNA was extracted using the Qiagen Blood & Tissue Kit and sent to SNPsaurus for whole genome genotyping. Libraries were sequenced on a paired-end 2x150bp NovaSeq 5000 S4 lane. The raw reads were filtered using Fastp.
Rename all raw reads
The names of all raw reads provided began with "TJ-". This was removed using the rename command.
rename 's/TJ-//' *.fastq.gz
Print the number of raw reads per sample in parallel
This code prints the names of all fastq.gz files, calculates the number of reads per sample, and prints both the sample name and the output.
ls *fastq.gz | parallel --keep-order 'echo -n "{} "; zcat {} | grep -c "^@NGS"' > rawread_counts.txt
Clean raw reads using Fastp
First, prepare a bash script called runFastp.sh that accepts a sample of paired reads and executes Fastp.
PROC=4
R1_FQ="$1"
R2_FQ="$2"
outDIR=/data2/tjenks/maerl_genomics/clean_reads
rawDIR=/data2/tjenks/maerl_genomics/raw_reads
fastp -i $rawDIR/${R1_FQ} -I $rawDIR/${R2_FQ} -o $outDIR/${R1_FQ%_*}.fastq.gz -O $outDIR/${R2_FQ%_*}.fastq.gz -q 20 --trim_poly_x --length_required 100 --thread ${PROC}
| Parameter | Description |
|---|---|
| -q 20 | base phred quality >= 20 |
| --trim_poly_x | perform both polyG (enabled by default) and polyX tail trimming |
| --length-required 100 | discard reads < 100 bp |
Second, navigate to the directory containing the raw reads and run the following code. This writes another bash script which sets up the input files for the runFastp.sh script.
paste <(ls -1 *R1_001.fastq.gz) <(ls -1 *R2_001.fastq.gz) | awk '{print "bash runFastp.sh", $1, $2}' > ../clean_reads/clean_reads.sh
Finally, run the clean_reads.sh script.
bash clean_reads.sh
Print the number of high quality clean reads per sample
ls *fastq.gz | parallel --keep-order 'echo -n "{} "; zcat {} | grep -c "^@NGS"' > cleanread_counts.txt
| No. of samples | Total raw reads | Total HQ reads | % Retained |
|---|---|---|---|
| 95 | 1,158,303,200 | 1,101,098,746 | 91.5 |
Extract the mitochondrial and plastid genomes from the cleaned reads using the programs GetOrganelle and Unicycler. For a guide to analysing NGS-derived organelles see Song et al. (2016). Each organelle extracted was compared to the NCBI GenBank database using BLASTn to check they matched other maerl species.
Software
GetOrganelle 1.6.4
Bowtie2 2.4.1 (ensure that the path to Bowtie2 version 2.4.1 is specified because GetOrganelle failed with previous versions)
Unicycler 0.4.9b
SPAdes 3.14.0
BLAST+ 2.9.0
Bandage
Download DNA sequence data to use as seeds
| Species | GenBank ID | Organelle | Locus | Length (bp) |
|---|---|---|---|---|
| P. calcareum | KF808323 | Mitochondrion | COI partial sequence | 651 |
| Lithothamnion sp. | MH281621 | Mitochondrion | Complete genome | 25,605 |
| L. corallioides | KC861502 | Mitochondrion | COI partial sequence | 664 |
| L. glaciale | HM918977 | Mitochondrion | COI partial sequence | 664 |
Extract mitogenome from one sample
First, extract target reads using GetOrganelle.
get_organelle_from_reads.py -1 read1.fq.gz -2 read2.fq.gz -o output_dir -s Lithothamnion_mitogenome.fasta,Pcalcareum_COI.fasta -F embplant_mt,embplant_mt -t 8 --which-bowtie2 /data2/tjenks/software/bowtie2-2.4.1-linux-x86_64/ -R 50 --no-spades
| Parameter | Description |
|---|---|
| -1, -2 | forward read, reverse read |
| -s | sequences for initial seed |
| -F | target organelle type (embplant_mt, animal_mt, fungus_mt) |
| -R 50 | maximum number of extension rounds |
| --no-spades | do not assemble reads using SPAdes |
Second, assemble the target reads using Unicycler.
unicycler -1 filtered_1_paired.fq -2 filtered_2_paired.fq -o unicycler_assembly --mode conservative
Lastly, view the assembled contig graph (*.gfa) in Bandage to check for circularity and BLAST the sequence on the NCBI nucleotide database to check sequence similarity.
# Example of a complete circular mitogenome from Tre04 sample
from IPython.display import Image
Image("Jupyter_files/Tre04_Bandage_graph.png", width=500, height=500)
Extract mitogenomes for all samples using bash script
First, prepare a bash script called runGetMitogenome.sh that accepts a sample of paired reads, executes GetOrganelle, and then assembles the target reads using Unicycler.
#!/bin/sh
PROC=8
R1_FQ="$1"
R2_FQ="$2"
readDIR=/data2/tjenks/maerl_genomics/clean_reads
seedDIR=/data2/tjenks/maerl_genomics/get_organelle/seed_sequences
get_organelle_from_reads.py -1 $readDIR/${R1_FQ} -2 $readDIR/${R2_FQ} -o ${R1_FQ%_*} -s $seedDIR/Lithothamnion_mitogenome.fasta,$seedDIR/Pcalcareum_COI.fasta -F embplant_mt,embplant_mt -t ${PROC} --which-bowtie2 /data2/tjenks/software/bowtie2-2.4.1-linux-x86_64/ --prefix ${R1_FQ%_*} -R 50 --no-spades
cd ${R1_FQ%_*}
unicycler -1 ${R1_FQ%_*}filtered_1_paired.fq -2 ${R1_FQ%_*}filtered_2_paired.fq -o unicycler_assembly --mode conservative
Second, navigate to the directory containing the clean reads and run the following code which writes another bash script that sets up the input files to runGetMitogenome.sh.
paste <(ls -1 *R1.fastq.gz) <(ls -1 *R2.fastq.gz) | awk '{print "bash rungetMitogenome.sh", $1, $2}' > ../dir_path/runBash.sh
Finally, execute the runBash.sh script.
bash runBash.sh
Notes:
When the organelle genome was manually extracted from the assembly graph in Bandage (on my Windows laptop), the resulting FASTA text file was encoded in Windows. This must be changed in notepad++ to Unix (LF) for it to work with some Unix programs.
Rotate sequences to start at the same position
To rotate all sequences to start at the same position, run the program MARS (Multiple circular sequence Alignment using Refined Sequences) which takes a FASTA file of sequences as input.
mars -a DNA -i input.fasta -o rotated.fasta
Download DNA sequence data to use as seeds
| Species | GenBank ID | Organelle | Locus | Length (bp) |
|---|---|---|---|---|
| P. calcareum | KC819266 | Plastid | psbA partial sequence | 889 |
| P. calcareum | MH274809 | Plastid | rbcL partial sequence | 1,322 |
| Lithothamnion sp. | MH281627 | Plastid | Complete genome | 183,822 |
| L. corallioides | KC819265 | Plastid | psbA partial sequence | 888 |
| L. glaciale | KC819271 | Plastid | psbA partial sequence | 853 |
Extract plastome from one sample
First, extract target reads using GetOrganelle.
get_organelle_from_reads.py -1 read1.fq.gz -2 read2.fq.gz -o output_dir -s Lithothamnion_plastome.fasta,Pcalcareum_psbA_rbcL.fasta -F embplant_pt,embplant_pt -t 8 --which-bowtie2 /data2/tjenks/software/bowtie2-2.4.1-linux-x86_64/ --no-spades
| Parameter | Description |
|---|---|
| -1, -2 | forward read, reverse read |
| -s | sequences for initial seed |
| -F | target organelle type (embplant_pt, animal_pt, fungus_pt) |
| --no-spades | do not assemble reads using SPAdes |
Second, assemble the target reads using Unicycler.
unicycler -1 filtered_1_paired.fq -2 filtered_2_paired.fq -o unicycler_assembly --mode conservative
Lastly, view the assembled contig graph (*.gfa) in Bandage to check for circularity and BLAST the sequence on the NCBI nucleotide database to check sequence similarity.
# Example of a complete circular plastome from Arm01 sample
from IPython.display import Image
Image("Jupyter_files/Arm01_Bandage_graph.png", width=500, height=500)
Extract plastomes for all samples using bash script
First, prepare a bash script called runGetPlastome.sh that accepts a sample of paired reads, executes GetOrganelle, and then assembles the target reads using Unicycler.
#!/bin/sh
PROC=8
R1_FQ="$1"
R2_FQ="$2"
readDIR=/data2/tjenks/maerl_genomics/clean_reads
seedDIR=/data2/tjenks/maerl_genomics/get_organelle/seed_sequences
get_organelle_from_reads.py -1 $readDIR/${R1_FQ} -2 $readDIR/${R2_FQ} -o ${R1_FQ%_*} -s $seedDIR/Lithothamnion_plastome.fasta,$seedDIR/Pcalcareum_psbA_rbcL.fasta -F embplant_pt,embplant_pt -t ${PROC} --which-bowtie2 /data2/tjenks/software/bowtie2-2.4.1-linux-x86_64/ --prefix ${R1_FQ%_*} --no-spades
cd ${R1_FQ%_*}
unicycler -1 ${R1_FQ%_*}filtered_1_paired.fq -2 ${R1_FQ%_*}filtered_2_paired.fq -o unicycler_assembly --mode conservative
Second, navigate to the directory containing the clean reads and run the following code which writes another bash script that sets up the input files to runGetMitogenome.sh.
paste <(ls -1 *R1.fastq.gz) <(ls -1 *R2.fastq.gz) | awk '{print "bash runGetMitogenome.sh", $1, $2}' > ../dir_path/runBash.sh
Finally, execute the runBash.sh script.
bash runBash.sh
Notes:
When the organelle genome was manually extracted from the assembly graph in Bandage (on my Windows laptop), the resulting FASTA file was formatted as a Windows text file. This must be changed in notepad++ to Unix (LF) for it to work with some Unix programs.
Rotate sequences to start at the same position
To rotate all sequences to start at the same position, run the program MARS (Multiple circular sequence Alignment using Refined Sequences) which takes a FASTA file of sequences as input.
mars -a DNA -i input.fasta -o rotated.fasta
After extracting an organelle contig, the organelle sequence was used as input to annotation software which predicts and annotate the genes captured in the organelle assembly. For red algae mitochondria, the NCBI genetic code used for prediction and annotation appears to be transl_table4 (The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code). For red algae plastid genomes, the NCBI genetic code used for prediction and annotation appears to be transl_table11 (The Bacterial, Archaeal and Plant Plastid Code). For a recent study on the evolutionary histories of red algal organelles see Lee et al. 2018.
The Genetic Code - Khan Academy
Software
GeSeq - annotation of organelle genomes, in particular chloroplast genomes.
MITOS - annotation of metazoan mitochondrial genomes.
MFannot - annotation of organelle genomes (helpful for genomes with many introns, outputs proteins sequences).
OGDRAW - visualisation of circular genome
GenomeVx - visualisation of circular genome
Note
After examining the output of both BLASTn and GeSeq, I noticed that some samples had the mitogenome or plastome sequence in reverse, which explained why there were large unexplanable differences in pairwise identity between two groups of samples using the ClustalW and Mafft aligners. To get around this, I took the reverse complement of one sequence 'type' (I called this type B) and then re-annotated the reversed sequences so that the gene coding sequences were in the same direction for every sample.
Mitogenome sequences were annotated using GeSeq. The Lithothamnion sp. GenBank record was used as a reference and the Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code was used as the genetic code for the annotation of tRNAs using ARWEN 1.2.3.
# Import annotation of Tre04 mitogenome
from IPython.display import Image
Image("Jupyter_files/Tre04_mitogenome_GeSeq.jpg", width = 800, height = 800)
The coding sequences (CDS) output from GeSeq for each sample was imported into Geneious 6.1.8. The CDS were sorted alphabetically and concatenated to generate one sequence for each sample containing 24 CDS.
# Import csv file
import pandas as pd
align_stats = pd.read_csv("Jupyter_files/mitogenome_cds.csv")
# Print all rows of the data
pd.set_option('display.max_rows', None)
align_stats
Align all mitogenome CDS
A FASTA file containing the concatenated CDS for each sample was exported and alignment was conducted in Unix using Mafft following the on-screen instructions.
module load MAFFT/7.305-foss-2018b-with-extensions
mafft
The alignment was then re-imported into Geneious and a UPGMA tree was built using Tamura-Nei genetic distances. Two alignments and trees were generated: (1) Only Phymatolithon calcareum samples, (2) All maerl samples (including Scotland Phymatolithon sp. and Lithothamnion sp.).
Plastome sequences were annotated using GeSeq. The Lithothamnion sp. GenBank record was used as a reference and The Bacterial, Archaeal and Plant Plastid Code was used for annotation and ARAGORN v1.2.38 was used to annotate tRNAs.
# Import annotated plastome of Tre04
from IPython.display import Image
Image("Jupyter_files/Tre04_plastome_GeSeq.jpg", width = 800, height = 800)
The coding sequences (CDS) output from GeSeq for each sample was imported into Geneious 6.1.8. The CDS were sorted alphabetically and concatenated to generate one sequence for each sample containing 24 CDS.
Align all plastome CDS
A FASTA file containing the concatenated CDS for each sample was exported and alignment was conducted in Unix using Mafft following the on-screen instructions.
module load MAFFT/7.305-foss-2018b-with-extensions
mafft
The alignment was then re-imported into Geneious and a UPGMA tree was built using Tamura-Nei genetic distances. Two alignments and trees were generated: (1) Only Phymatolithon calcareum samples, (2) All maerl samples (including Scotland Phymatolithon sp. and Lithothamnion sp.).
The FASTA alignments were imported into R for analysis and tree building.
Sequence reads were trimmed using bbduk (BBMap tools) and a de novo reference was assembled from one sample (Mor02) using abyss-pe using standard parameters. The contigs from the reference were run through BLASTn and any contigs with blast hits to bacteria or other contaminating species were removed. (This step was done by SNPsaurus).
Align reads using bwa-mem2
Index reference
bwa-mem2 index assembly_Mor02_final.fasta
Prepare a bash script that aligns clean reads to the reference using bwa-mem2
#!/bin/sh
PROC=6
R1_FQ="$1"
R2_FQ="$2"
ref=/data2/tjenks/maerl_genomics/assembly_pcal/clean_maerl_Mor02_250.fa
clean=/data2/tjenks/maerl_genomics/clean_reads
bwa-mem2 mem -t ${PROC} $ref $clean/${R1_FQ} $clean/${R2_FQ} > ${R1_FQ%_*}.sam
samtools view --threads ${PROC} -b ${R1_FQ%_*}.sam > ${R1_FQ%_*}.bam
samtools sort --threads ${PROC} ${R1_FQ%_*}.bam -o ${R1_FQ%_*}.sorted.bam
rm ${R1_FQ%_*}.sam
rm ${R1_FQ%_*}.bam
Navigate to the directory containing the clean reads and run the following command to prepare the input files
paste <(ls -1 *R1.fastq.gz) <(ls -1 *R2.fastq.gz) | awk '{print "bash x_align.sh", $1, $2}' > ../12.bwa-mem2_alignments/y_input.sh
Execute by running bash y_input.sh
Filter alignments and generate stats
Count and print the mapping scores for the alignments (The highest quality score from bwa-mem2 is 60)
samtools view input.bam | cut -f 5 | sort -n | uniq -c
Filter alignments
for file in *.bam; do filename="${file%%.*}"; samtools view --threads 6 -f 0x2 -bq 30 $file > ${filename}.sorted.filt.bam; done
for file in *.filt.bam; do filename="${file%%.*}"; samtools flagstat --threads 6 $file > ${filename}.sorted.filt.bam.stats; done
| Parameter | Description |
|---|---|
| -f 0x2 | only retain alignments in which paired reads properly aligned |
| -q | only include reads with a mapping quality of 30 |
| -b | output BAM |
Count the number of mapped reads for each sample
cat *.stats | grep "mapped (" | awk '{print $1}' > reads_mapped_to_reference
Alignment stats
# Import csv file
import pandas as pd
align_stats = pd.read_csv("Jupyter_files/sequencing_alignment_stats.csv")
# Print all rows of the data
pd.set_option('display.max_rows', None)
align_stats
1. Call variants using bcftools
Create alignment list file
ls -1 ../12.bwa-mem2_alignments/*.filt.bam > alignment_list
Run bcftools mpileup and pipe the output to bcftools call
bcftools mpileup --threads 6 --annotate FORMAT/AD,FORMAT/DP,INFO/AD -Ou -f ../assembly_pcal/assembly_Mor02_final.fasta --bam-list alignment_list | bcftools call --threads 6 --variants-only --multiallelic-caller -Ov -o variant_calls.vcf
Create a file containing old vcf header and new vcf headers
paste <(cat alignment_list) <(ls -1 ../12.bwa-mem2_alignments/*.filt.bam | grep -o -P ".{0,5}.sorted" | sed 's/.sorted//g') > rename_samples
Edit file to add L in front of Lcor samples.
Rename vcf headers
bcftools reheader --samples rename_samples -o variant_calls.vcf.rename variant_calls.vcf
2. Call variants using bcftools with ploidy model
Create samples list file with ploidy
ls -1 ../raw_reads/*fastq.gz | awk '{print $0, "\t", "3"}' > ploidy_list.txt
Run bcftools mpileup and pipe the output to bcftools call
bcftools mpileup --threads 6 --annotate FORMAT/AD,FORMAT/DP,INFO/AD -Ou -f ../assembly_pcal/clean_maerl_Mor02_250.fa --bam-list alignment_list | bcftools call --threads 6 --variants-only --multiallelic-caller -Ov -o variant_calls.vcf --samples-file ploidy_list.txt
ISCA
Call variants using bcftools
module load BCFtools/1.9-foss-2018b
bcftools mpileup --threads 16 -Ou -f ../assembly_Mor02_final.fasta --bam-list alignments_list.txt | bcftools call --threads 16 -mv -Ov -o variant_calls.vcf
[Time: 14:13:01]
TEST: Call variants using bcftools with ploidy model
module load BCFtools/1.9-foss-2018b
bcftools mpileup --threads 16 --annotate FORMAT/AD,FORMAT/DP,INFO/AD -Ou -f ../../assembly_Mor02_final.fasta --bam-list ../alignments_list.txt | bcftools call --threads 16 --samples-file ploidy_list.txt --variants-only --multiallelic-caller -Ov -o variant_calls.vcf
[Time: ]
Convert bcf to vcf (if required)
bcftools view --output-type v variant_calls.bcf > variant_calls.vcf
Filter vcf using vcftools
Filter 1 (reduce dataset)
vcftools --vcf variant_calls.vcf --out filter1 --recode --recode-INFO-all --max-missing 0.70 --minQ 30
| Filtering step | Number of SNPs |
|---|---|
| Start | XXX |
| --max-missing 0.70 --minQ 30 | XXX |
Filter 2 (remove individuals with missing data)
Assess individual missing data
vcftools --vcf filter1.recode.vcf --missing-indv --out missingdata
cat missingInd.imiss
Create a list of individuals with more than 20% missing data
awk '$5 > 0.20' missingdata.imiss | cut -f 1 > missingdata-greater20
Remove individuals from data set
vcftools --vcf filter1.recode.vcf --out filter2 --recode --recode-INFO-all --remove missingdata-greater20
Filter 3 (all filters)
vcftools --vcf filter2.recode.vcf --out filter3 --recode --recode-INFO-all --max-missing 0.97 --minDP 5 --maxDP 100 --min-alleles 2 --max-alleles 2 --remove-indels --mac 5
| VCFtools filtering step | Number of SNPs |
|---|---|
| --max-missing 0.97 | |
| --minDP 5 | |
| --maxDP 100 | |
| --min-alleles 2 --max-alleles 2 | |
| --remove-indels | |
| --mac 5 | XXX |
Filter 4 (linkage disequilibrium)
vcftools --vcf filter3.recode.vcf --geno-r2 --min-r2 0.7
Tabular result:
| CHR | POS1 | POS2 | N_INDV | R^2 |
|---|---|---|---|---|
| tig_32046 | 6390 | 6399 | 66 | 1 |
| tig_76165 | 173 | 1058 | 64 | 1 |
| tig_76165 | 173 | 2675 | 65 | 1 |
| tig_76165 | 1058 | 2675 | 64 | 0.84 |
| tig_...n | x | x | x | x |
Extract unique rows of POS1 (column 2) which represent one of the loci in LD to remove from the dataset and redirect only the CHR and POS1 columns to a text file.
cat out.geno.ld | awk '!seen[$2]++' | grep -v "^CHR" | cut -f 1,2 > loci_in_ld.txt
wc -l loci_in_ld.txt
Remove LD loci from data set
vcftools --vcf filter3.recode.vcf --out filter4 --recode --recode-INFO-all --exclude-positions loci_in_ld.txt
| VCFtools filtering step | Number of SNPs |
|---|---|
| --geno-r2 (r2 > 0.7) | XX |
Extract variants between P. calcareum and L. corallioides
module load VCFtools/0.1.16-foss-2018b-Perl-5.28.0
vcftools --vcf ../variant_calls.vcf --out ./hybrid1 --recode --recode-INFO-all --keep hybrid_ind_to_keep.txt
vcftools --recode --recode-INFO-all --vcf hybrid1.recode.vcf --out hybrid2 --minQ 30 --max-missing 0.90 --min-alleles 2 --max-alleles 2